Performability Modeling of Coordinated Software and Hardware Fault Tolerance

نویسندگان

  • Ann T. Tai
  • William H. Sanders
چکیده

Quantitative system evaluation concerning both software design faults and hardware operational faults has not yet received enough attention. Although a few studies considered such dependability analysis, analytic evaluation of systems with integrated software and hardware fault tolerance remains a challenge. In particular, the need to distinguish the effects of software design faults from those of random hardware faults often makes problems very complex. Such evaluation may become even more difficult when we are concerned with 1) distributed computing environments, 2) results other than instantaneous steady-state measures, 3) fault tolerance beyond simple application of resource redundancy, and/or 4) combined assessments of performance and dependability, such as performability. In this paper, we tackle the problem of performability evaluation for a scheme of coordinated software and hardware fault tolerance we developed earlier [1]. The scheme involves 1) a time-based (TB) checkpointing protocol developed by Neves and Fuchs for tolerating hardware faults, and 2) our message-driven confidence-driven (MDCD) protocol for software error containment and recovery. The two protocols coordinate through their checkpointing activities, guaranteeing that when recovering from a software or hardware error in a distributed computing environment, the system will always reach a global state that is not only consistent but also valid (i.e., it does not reflect the receipt of notyet-validated messages from low-confidence processes). The system in question comprises two application processes that interact via message passing. One of the processes is created from a low-confidence software component; this process, call it P 1 , is escorted by the MDCD protocol, which lets a high-confidence version, P 1 , run in the background to enable error recovery. Thus the system has two active interacting processes, P 1 and P2 (i.e., the second application process, which is a high-confidence component), and a shadow process P 1 . The protocol coordination scheme emphasizes avoiding potential interference between software and hardware fault tolerance techniques and enabling them to be mutually supportive. Specifically, the MDCD protocol is responsible for establishing a checkpoint in volatile storage upon the occurrence of a message-passing event that lowers our confidence in the correctness of a process state to enable software error containment and recovery. The duty of the TB protocol is to save consistent and valid checkpoints to stable storage, based on periodically resynchronized timers, to tolerate transient hardware faults. Moreover, the TB protocol requires processes to start a blocking period of a minimum length (during which a checkpoint is saved to stable storage) upon timer expiration, to ensure global state consistency. Since the required blocking period is an increasing function of clock drift, which in turn is an increasing function of the elapsed time since clock synchronization, the system must periodically undergo resynchronization to prevent the blocking period from becoming longer than the time required to save a checkpoint to stable storage. When a process fails to pass an AT, software error recovery will be invoked and a process will roll back to its volatile-storage checkpoint if its dirty bit equals 1; otherwise, the process will roll forward. However, when a process’s host encounters a transient hardware error, all the processes will roll back to their stable-storage checkpoints. The coordination scheme does not rely on costly message exchange among the participating protocols, which preserves the performance advantages of the original MDCD and TB protocols. Accordingly, it is important to verify that the scheme indeed enhances a system’s performability.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Towards Performability Modeling of Software Rejuvenation

In this paper, we discuss issues in performability modeling of \software rejuvenation," a form of software fault tolerance based on occasionally cleaning up the operational environment. System factors which play a key role in such a model are identi ed. Among these, we comment on two issues of particular interest when modeling software rejuvenation: (1) the representation of the degradation in ...

متن کامل

Performability Modeling of Exceptions-Aware Systems in Multiformalism Tools

Exceptions constitute a widely accepted fault tolerance mechanism, suitable to manage both hardware and software faults. In performability analysis it is a common practice to exploit software tools capable of describing a system using models expressed in various formalisms. Often these tools provide extensibility features that allow augmenting the primitives of a given formalism, but in most ca...

متن کامل

On Performability Modeling and Evaluation of Software Fault Tolerant Structures

An adaptive scheme for software fault-tolerance is evaluated from the point of view of performability, comparing it with previously published analyses of the more popular schemes, recovery blocks and multiple version programming. In the case considered, this adaptive scheme, "Self-Configuring Optimistic Programming" (SCOP), is equivalent to N-version programming in terms of the probability of d...

متن کامل

Dependable LQNS: A Performability Modeling Tool for Layered Systems

Dependable LQNS is a software tool for modeling and evaluating performability of fault-tolerant layered distributed applications that use a separate architecture for failure detection and reconfiguration. It takes into account the effects of management architecture, application software architecture, failure of management and application components in the dependability computation. It uses a co...

متن کامل

Layered Dependability Modeling of an Air Traffic Control System

Quality attributes, such as performance and dependability of a software-intensive system are constrained by its software architecture. The combined performance and dependability (called performability) effects of an architecture can be evaluated by constructing a performability model that considers the failure/repair behavior and performance attributes of its components, interactions among the ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003